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verify  that  each  digit  spoken  into  the  system  was  correctly  recognized.  Errors 
can  be  corrected  through  the  use  of  the  control  words. 

To  confirm  system  performance  several  final  tests  were  held,  two  of  which 
included  live  inputs  rather  than  tape  recordings.  Individual  digit  recognition 
accuracy  in  each  of  two  tests  from  magnetic  tape  was  98.7  percent  for  a total 
of  65  speakers.  In  the  live  tests  a total  of  30  speakers  each  spoke  into  the 
system's  75  groups  of  digits,  each  group  consisting  of  four  digits  followed  by 
the  word  VERIFY  to  simulate  operational  conditions.  Individual  digit  accuracy 
in  these  tests  was  97.9  percent  for  30  speakers.  Approximately  92.5  percent  of 
all  digit  groups  were  lnnutted  and  verified  without  error.  The  remaining 
groups  were  corrected  arflj  properly  entered.  With  feedback  verification  and 
error  correction  all  talkers  were  able  to  enter  all  digit  groups  correctly. 

Most  codes,  together  with  the  verify  command,  were  entered  in  four  to  seven 
seconds  when  no  errors  were  detected.  Typically,  10  to  12  seconds  were  required 
to  observe  and  correct  a digit  error  and  enter  the  conected  code. 

The  VICI  system  is  based  upon  the  VIP-1C0  isolated  word  recognition  system  which 
normally  requires  the  input  of  training  data  by  each  talker  who  uses  the  system. 
For  use  in  the  VICI  application  both  hardware  and  software  modifications  were 
made  to  a VIP-100  system  to  allow  recognition  of  the  VICI  vocabulary  spoken 
by  a large  speaker  population  without  adaptation  or  training  by  any  speaker 
from  a large  population  of  General  American  males. 
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EVALUATION 


This  report  represents  a major  achievement  in  the  area  of  auto- 
matic speech  processing.  It  proved  that  it  is  possible  to  achieve  high 
word  recognition  scores  in  real  time  using  a limited  vocabulary  with 
words  spoken  in  a discrete  manner  and  Independent  of  speaker  for  male 
speakers  regardless  of  geographic  accent.  With  the  aid  of  visual  feed- 
back, all  errors  were  able  to  be  corrected  thus  insuring  proper  data 
entry  into  the  machine. 

Because  of  the  success  of  this  program,  many  practical  applications 
are  now  emerging.  For  example,  the  Voice  Input  Code  Identifier  (VICI) 
will  be  used  in  conjunction  with  the  ESD  Base  and  Installations  Security 
System’s  "Automatic  Speaker  Verification"  (ASV)  system.  The  ASV  system 
which  was  developed  by  RADC  uses  the  voice  characteristics  of  an 
individual  as  a means  of  authenticating  him  for  entry  control.  Presently, 
the  ASV  system  requires  the  Individual  to  identify  himself  with  a four 
digit  code  by  an  input  device  such  as  a keyboard  or  badgereader.  VICI 
shall  eliminate  the  need  for  Input  devices  and  will  allow  an  individual  to 
"speak"  his  code  numbers  as  a means  of  identifying  himself  to  the 
verification  system. 

In  addition,  this  word  recognition  technology  will  be  transitioned 
into  a natural  USAF  application.  A voice  actuated  system  shall  supply 
pertinent  Information  to  a computer  as  an  aid  for  cartographers.  Present 
mapping  techniques  require  a cartographer  to  position  a X - Y reader 
device  over  a smooth  sheet,  read  the  required  bathometric  numbers  via  the 
map  and  then  enter  these  digits  to  a computer  which  correlates  them  with 
the  positioning  device.  This  process  of  turning  away  from  the  table  to 
enter  numbers  via  the  manual  keyboard  diverts  the  operator's  attention 
and  tends  to  slow  down  the  data  entry  process.  By  utilizing  a word 
recognizer,  the  operator  can  speak  the  required  digits  and  enter  them 
automatically  into  the  computer  without  losing  sight  of  the  manuscript. 

The  voice  system  is  more  efficient  in  that  it  will  reduce  the  data  entry 
time  which  presently  averages  12  seconds  to  an  average  of  3 seconds. 

These  applications  and  others  will  insure  that  voice-controlled 
devices  will  have  a valuable  role  in  future  information  processing  systems. 

RICHARD  S.  VONUSA 
Project  Engineer 
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BACKGROUND  AND  INTRODUCTION 

The  application  of  an  automatic  speech  recognition  (ASR)  system  as  a 
front-end  for  the  Base  and  Installation  Security  System's  (BISS)  automatic 
speaker  verification  system  can  provide  a more  reliable  means  of  entering 
speaker  verification  data.  An  automatic  speaker  verification  experimental 
model  was  fabricated  under  RADC  Contract  F30602-72-C-0294.  To  use  this  veri- 
fication system  it  was  first  necessary  for  an  individual  to  manually  enter, 
via  a keyboard,  a sequence  of  digits  to  alert  the  system  as  to  his  identity. 
This  manual  data  entry  can  now  be  eliminated  by  the  use  of  "Voice  Input  Code 
Identifier"  (VICI)  system  which  has  been  developed  during  the  contract  des- 
cribed in  this  report.  The  combination  of  the  VICI  system  and  the  speaker 
verification  system  can  provide  implementation  of  a fully  automatic  voice 
oriented  technique  to  allow  an  individual  requesting  base  entry  to  claim  iden- 
tity and  be  verified.  Thus,  the  need  for  picture  badges,  the  keypunching  of 
code  numbers  and  other  fallible  mechanical  methods  of  an  individual  claiming 
his  valid  identity  will  be  eliminated. 

The  VICI  system  has  been  developed  to  recognize  with  very  high  accuracy 
the  English  digits  zero  through  nine,  plus  the  control  words  CANCEL,  ERASE, 
VERIFY  and  TERMINATE  independent  of  speaker  for  a large  population  of  General 
American  males.  A feedback  system  has  been  incorporated  to  allow  the  speaker 
to  verify  each  digit  entry  and  if  necessary  to  correct  a faulty  entry  by  the 
use  of  the  control  word,  ERASE,  and  then  enter  a new  digit.  A complete  code 
group  of  four  digits  can  be  accepted  by  the  use  of  the  control  word  VERIFY  or 
rejected  by  the  word  CANCEL.  The  speaker  can  view  on  an  alphanumeric  display 
each  recognized  digit  within  .1  to  .2  seconds  after  it  is  pronounced  in  order 
to  verify  the  correctness  of  each  digit  entry.  Live  tests  involving  a total 
of  30  speakers  showed  that  a four  digit  group  could  be  entered  into  the  VICI 
system  with  verification  in  as  short  an  interval  as  2.8  seconds.  Four  to 
seven  seconds  were  typically  required  for  most  speakers  for  a digit  group  if 
no  errors  were  made  either  by  the  speaker  or  the  system.  Ten  to  12  seconds 
were  required  by  most  speakers  to  detect  and  correct  an  error  and  complete  the 
entry  of  a proper  code.  It  was  necessary  to  employ  correction  for  an  average 
of  7.5^  of  the  75  digit  groups  spoken  by  the  29  participants  in  the  live  tests. 
In  every  case  the  errors  were  correctable  and  every  code  was  entered  properly. 

In  addition  to  the  live  tests  which  were  conducted  just  prior  to  and  at 
the  time  of  delivery  to  RADC  of  the  VICI  equipment,  several  tests  series  were 
conducted  by  the  use  of  magnetic  tape  recordings  of  a total  of  65  male  speak- 
ers only  11  of  whom  were  used  subsequently  in  the  live  tests.  The  speakers 
who  made  the  tape  recordings  over  a period  of  several  months  ranged  in  age 
from  16  years  to  65  years.  The  majority  of  these  speakers  were  in  the  20  to 
40  year  age  bracket.  Overall,  therefore,  the  VICI  system  has  been  tested  by 
83  male  speakers. 

The  VICI  system  developed  for  this  contract  is  based  upon  the  Threshold 
Technology  Inc.  (TTI)  commercial  VIP-100  limited  vocabulary  isolated  word 
recognition  system.  The  VIP-100  normally  requires  training  (adaptation)  by 
each  talker  using  it.  This  training  is  accomplished  by  inputting  five  to  10 
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samples  of  each  vocabulary  word  by  each  user.  The  VIP-100  which  served  as 
the  basis  for  the  VICI  system  was  modified  in  both  hardware  and  software  to 
allow  operation  without  the  necessity  of  entering  any  training  data  for  each 

speaker. 


The  VIP- 100  system  includes  a speech  preprocessor  and  a minicomputer, 
the  Nova  1200  manufactured  by  Data  General  which  includes  8K  °J 
For  verification,  a display  module  based  upon  a Burroughs  Self-Scan  alphanu- 
meric display  panel  is  included.  The  display  has  a 32  character  memory  and  is 
capable  of  displaying  16  characters  at  a time.  The  microphone  used  m the  e- 
velopment  of  the  VICI  system  is  a Telex  model  1200  which  is  a ; 
unit^  An  ASR  33  Teletype  has  been  supplied  for  control  and  data  mput/outp  t 
functions.  Figure  1 is  a photograph  of  a VIP-100  system. 


Section  II  of  this  report  describes  the  basic  approaches  to  speech  recog- 
nition which  led  to  development  of  the  VIP-100,  together  with  a description  of 
the  operating  principles  of  the  VIP-100.  Next,  the  development  of  universal 
talker  data  characteristics  is  discussed.  Experil.er.ts  with  siugle-repetition 
training  samples  are  described,  followed  by  a description  of  the  hardware  and 
software  modifications  necessary  to  accomodate  a large  talker  set  without  in- 
dividual training.  A description  and  the  results  of  final  system  tests,  both 
live  and  from  tape  are  included  in  Section  III.  Conclusions  and  recommenda- 
tions are  listed  in  Section  IV. 
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Figure  1.  VIP-100  automatic  speech  recognition  system. 


Section  II 

TECHNICAL  DISCUSSION 


A.  Introduction 

In  order  to  best  meet  the  requirements  of  this  progr^  in  the  d.wlopment 
of  an  Advanced  Development  Model  £d  “ftware  for 

100  word-recognit ion  system  was  modified  in  developed  by  tTI  for  com- 

this  application.  The  VIP- 100  system  was  pr  ^ ltlm  system  „lth  ad- 

^^lorla^h1^"1^  system  as  a requirement^  ^operation. 

in  the  following  = followed'by  fn'ouUine  of  the  operation 

opment  of  the  VIP-100  are  presentee  roriuwe  / hardware  and  software  modifi- 

0fta  VIPf1the  v5*lSrSS5fS;  J^StbSe adaptation  l,  then  explained. 
The'viP-lOO^system^supplied^othis  program  includes  a speech  Preprocessor  a 
Nova  Loo  minicompute?  manufactured  by  Data  General  Corporation  with  8K  of 
c?S  memory,  and  l 16  character  alphanumeric  Oreo  ay  module^  A Jelexjdel  ^ 

inpSt/output  functions. 

B Basic  Approaches  to  Automatic  Speech  Recognition 
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Figure  2.  Pattern  recognition  process. 
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ducer,  a preprocessor,  feature  extractor  and  a final  decision  level  classifier. 
Early  attempts  at  automatic  speech  recognition  either  deleted  entirely  the  fea- 
ture extraction  process  or  utilized  a simplified  form  of  template  matching. 
Experience  with  template  matching  soon  led  to  the  realization  of  its  limita- 
tions. Slight  variations  of  the  individual  speech  samples  of  a particular 
word  would  result  in  gross  misclassifications.  This  limitation  resulted  in 
the  impractical  requirement  for  a large  memory  containing  a pattern  and  all 
its  prototypes. 

Considerable  mathematical  fomalism  has  been  developed  for  various  auto- 
matic speech  recognition  processes.  However,  no  general  theory  exists  which 
can  preselect  the  information  bearing  portions  of  the  speech  signal.  There- 
fore, the  design  of  the  feature  extractor  is  heuristic  and  must  use  ad  hoc 
strategy.  Only  actual  experimental  data  can  determine  the  value  of  a parti- 
cular feature  set.  It  is  this  particular  dilemma  which  has  resulted  in  the 
recent  increased  emphasis  given  feature  extraction  research  for  pattern  recog- 
nition systems. 

It  is  possible  to  form  many  transformations  of  the  speech  signal  which 
would  enhance  certain  properties  and  make  them  more  easily  detectable  in  an 
automatic  speech  recognition  system.  However,  speech  is  neither  periodic  nor 
aperiodic,  but  must  be  considered  as  a quasi-periodic  signal  so  that  analyti- 
cal techniques  that  are  developed  must  reflect  temporal  features  of  signifi- 
cance as  well  as  spectral  features.  Maintaining  this  dual  viewpoint  through- 
out the  analysis  requires  a modification  of  classical  time-domain  and  fre- 
quency-domain analytical  techniques.  To  retain  both  of  these  characteristics 
in  a frequency  analysis,  a method  which  produces  a short -duration  spectrum  is 
essential . 

Frequency- domain  representation  of  the  speech  signal  is  particularly  ad- 
vantageous since  (1)  it  is  known  that  the  human  auditory  system  performs  a 
crude  frequency  analysis  at  the  periphery  of  auditory  sensation  and  (2)  be- 
cause it  has  been  shown,  by  acoustical  analysis  of  the  vocalization  system, 
that  an  exact  description  of  the  speech  sounds  can  be  obtained  with  a natural 
frequency  concept  model  of  speech  production. 

A periodic  function  of  time  possesses  a power  spectrum  with  finite  amounts 
of  power  located  a discrete  points  in  the  spectrum,  commonly  described  as  a line- 
spectrum.  An  aperiodic  function  that  contains  finite  energy  and  is  Fourier- 
transformable  possesses  an  energy  density  spectrum  that  is  a continuous  func- 
tion of  frequency.  For  analyzing  speech  signals,  it  is  desirable  to  obtain 
the  spectral  energy  distribution  and  its  variations  as  a function  of  time. 
Sufficient  resolution  must  be  maintained  in  both  the  frequency  and  time  do- 
mains so  that  all  of  the  information-bearing  properties  in  both  domains  can 
be  detected. 

Spectrum  analysis  can  be  achieved  by  direct  analog  circuitry,  through 
the  use  of  the  Fast  Fourier  Transform  (FFT)  and  a high  speed  digital  computer 
or  by  the  use  of  linear  predictive  analysis.  In  all  of  these  methods,  equiv- 
alent problems  occur.  The  FFT  produces  a discrete  spectrum  which,  with  a 
sufficiently  high  sampling  rate,  approaches  that  of  the  continuous  Fourier 
Transform.  Many  different  types  of  data  windows  have  been  utilized  in  the 
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FFT.  The  choice  of  the  window  is  similar  to  the  choice  of  the  filter  response 
in  the  analog  spectrum  analyzer.  A "picket  fence"  effect  can  occur  both  in 
the  FFT  and  the  analog  spectrum  analyzer  representing  the  contributions  of  the 
individual  filters  in  the  analog  analyzer  or  the  separate  coefficients  of  the 
various  terms  in  the  FFT  calculation.  Analogous  problems  are  introduced  using 
linear  predictive  analysis  in  the  selection  of  the  number  of  coefficients  em- 
ployed in  the  process.  In  all  cases,  however,  spectrum  analysis  is  only  the 
first  step  in  the  feature  extraction  process.  Considerable  additional  pro- 
cessing is  required  in  order  to  achieve  the  detection  and  recognition  of  the 
information-bearing  elements  (significant  features)  of  the  speech  signal  which 
has  been  transformed  to  accentuate  these  elements  in  the  spectrum  analysis 
process . 


The  final  processing  level  after  the  recognition  of  the  elemental  speech 
units  is  the  word  decision  logic.  For  isolated  words,  it  is  possible  to  exam- 
ine the  phonetic  sequences  produced  by  a feature  extractor  and  to  determine 
the  closest  match  to  a set  of  stored  reference  samples.  The  decision  involv- 
ing the  closest  match  is  made  at  the  end  of  the  word  and  can  be  achieved  with 
relatively  simple  processing  techniques.  These  reference  samples  can  be  ob- 
tained from  a particular  talker  as  in  a trainable  speech  recognition  system 
or  can  be  universal  samples  as  have  been  developed  during  this  investigation. 


The  VIP- 100  speech  recognition  system,  designed  by  TTI  as  a general  pur- 
pose speech  recognizer  has  served  as  the  basis  for  the  VICI  advanced  develop- 
ment model  developed  during  this  contract.  The  VIP-100  employs  the  processing 
functions  just  described. 


C.  Description  of  the  VIP-100 

The  VIP-100  was  originally  designed  to  recognize  a vocabulary,  essential- 
ly unrestricted  in  content  (but  restricted  in  size  by  the  storage  limitations 
of  the  core  memory  of  the  associated  minicomputer)  with  automatic  adaptation 
for  individual  speakers  and  words.  This  system  has  been  modified  to  allow 
the  recognition  of  a specific  fixed  vocabulary  by  an  unlimited  speaker  set 
without  adaptation  to  individual  speakers.  The  modifications  which  were  made 
to  the  VIP-100  are  described  in -Section  II. F. 

Operation  of  the  basic  VIP- 100  is  described  in  the  following  paragraphs. 
Figure  3 is  a block  diagram  of  the  system  as  originally  designed.  Both  the 
oreprocessor  and  feature  extractor  functions  are  hardwired.  The  classifier 
function  is  performed  by  software  in  a Data  General  Nova  1200  minicomputer. 

The  minicomputer  also  time  normalizes  word  durations  and  provides  core  stor- 
age of  the  reference  patterns  for  each  word  in  the  vocabulary. 


1.  Preprocessor 

The  initial  section  of  the  preprocessor  shapes  the  output  from  the 
microphone  to  remove  irregularities  and  produce  a normalized  speech  spectrum. 
This  equalized  signal  is  then  passed  through  a real-time  spectrum  analyzer 
consisting  of  a bank  of  19  contiguous  active  bandpass  filters  ranging  m 
center  frequency  from  260  Hz  to  7626  Hz.  The  outputs  of  the  filters  are  full- 
wave  rectified  and  logarithmically  compressed.  This  latter  operation  provides 
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a 50dB  dynamic  range  and  produces  ratio  measurements  when  subsequent  features 
are  derived  from  summation  and  differencing  operations,  thereby  minimizing 
the  input  amplitude  dependence. 

2.  Feature  Extractor 

The  function  of  the  spectral  shape  detector  is  to  develop  spectral 
derivative  (dE/df)  features  indicating  the  overall  spectrum  shape.  The  spec- 
tral shape  and  its  changes  with  time  are  continuously  measured  over  the  fre- 
quency range  of  interest.  Combinations  and  sequences  of  these  measurements 
are  processed  to  produce  a set  of  significant  acoustic  features. 

The  features  used  in  the  VIP-100  are  a selected  subset  (including 
complex  combinations)  of  32  acoustic  features.  Each  feature  is  extracted  by 
a combination  of  analog  operations  and  binary  logic.  The  output  of  the  fea- 
ture extractor  consists  of  32  binary  signals,  Fj,  F ^ , •••^32' 

The  features  are  of  two  types,  primary  features  and  phonetic-event 
features.  Features  of  the  former  category  describe  the  spectrum  directly  by 
indicating  local  maxima  and  areas  of  increasing  or  decreasing  energy  with 
frequency  (slopes).  The  latter  category  consists  of  features  which  represent 
measurements  corresponding  to  phoneme-like  events.  Included  in  this  set  are 
vowels,  nasals  and  fricatives. 

3.  Minicomputer  Functions 

The  minicomputer  performs  the  functions  shown  in  Figure  3.  For  a 
spoken  word,  the  32  encoded  features  and  their  time  of  occurrence  are  stored 
in  a short  term  memory.  When  the  end  of  the  utterance  is  detected  by  the 
feature- extractor  logic,  the  duration  of  the  word  is  divided  into  16  time 
segments  and  the  features  are  reconstructed  into  a normalized  time  base.  The 
pattern-matching  logic  subsequently  compares  these  feature  occurrence  patterns 
to  the  stored  reference  patterns  for  the  various  vocabulary  words  and  deter- 
mines the  "best  fit"  for  a word  decision.  512  bits  of  information  (32  fea- 
tures mapped  into  16  time  segments)  are  required  to  store  the  feature  array 
of  an  utterance  or  reference  pattern. 

4.  Training 

The  training  mode  of  the  operation  is  a necessary  prelude  to  the 
normal  operation  of  a VIP-100  system  when  the  system  is  used  as  a word  recog- 
nition system  which  is  adaptable  to  individual  talkers.  The  VIP-100  which 
has  been  modified  for  VICI  use  with  a universal  speaker  set  does  not  normally 
require  training  (adaptation)  by  a particular  talker.  However,  the  ability 
to  be  adapted  to  or  trained  for  each  speaker  has  been  retained  in  the  VICI 
VIP- 100.  Furthermore,  a series  of  experiments  were  conducted  involving  the 
use  of  single  training  samples  for  certain  digits  for  increased  accuracy. 

These  experiments  are  described  in  Section  II. E of  this  report.  During  the 
training  mode  of  a conventional  VIP-100,  or  the  VICI  VIP-100,  a time-normal- 
ized feature  array  is  extracted  for  each  repetition  of  a given  word.  A c°n_ 
sistent  array  of  feature  occurrences  (between  repetitions)  is  required  before 
the  features  are  stored  in  the  reference  pattern  memory.  A template  threshold 
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factor  is  chosen  such  that  a feature  occurrence  (in  a given  time  segment)  is 
considered  valid  only  when  it  occurs  a minimum  number  of  times  relative  to  the 
number  of  training  samples.  Usually,  this  threshold  factor  is  set  to  be  be- 
tween 30-50%  of  feature  occurrences  within  the  training  samples. 

5.  Recognition  Mode  of  Operation 

In  the  operational  mode,  each  new  word  spoken  into  the  system  is  pro- 
cessed in  a manner  analogous  to  the  training  procedure--i.e. , feature  extract- 
ed, digitized  and  time  normalized.  The  resultant  test  word  array  then  is  com- 
pared digitally  to  the  stored  reference  array  for  each  vocabulary  word.  Simi- 
larities and  dissimilarities  in  each  compared  array  are  appropriately  weighted 
and  the  net  result  provides  a weighted  correlation  product.  Correlation  pro- 
ducts also  are  generated  after  shifting  the  input  word  array  ±1  time  segment. 
The  stored  reference  word  array  producing  the  highest  overall  correlation  is 
selected  as  the  test  word.  This  decision  is  then  displayed  to  the  speaker  in 
an  appropriate  manner  for  verification  of  accuracy. 

D.  Development  of  a Universal  Reference  Array  Set 

As  previously  mentioned,  an  important  preliminary  phase  in  the  operation 
of  the  VIP- 100  system  is  adaptation  of  the  system  for  the  voice  of  a particu- 
lar user  by  means  of  inputting  training  samples  of  each  word  in  the  vocabulary. 
During  this  adaptation  or  training  phase,  each  vocabulary  word  is  pronounced 
by  the  user  from  one  to  10  times.  Usually,  10  repetitions  of  each  word  in  the 
training  phase  are  used  in  order  to  assure  maximum  recognition  accuracy.  It 
has  been  observed,  however,  that  a single  training  word  for  each  vocabulary 
word  is  adequate  for  good  accuracy  if  the  training  words  are  spoken  in  close 
time  proximity  to  the  test  data  inputs.  Therefore,  a possible  mode  of  opera- 
tion of  the  VICI  system  would  be  to  input  the  VICI  four-digit  code,  preceded 
by  a complete  training  phase  with  a single  word  training  sample.  However, 
such  a procedure  is  undesirable  from  an  operational  standpoint  because  of  the 
time  required.  Experience  has  shown  that  at  least  one  second  per  word  would 
be  necessary  during  the  training  and  recognition  phases  with  naive  speakers 
who  would  typically  use  such  a system  in  the  field.  Therefore,  at  least  20 
seconds  would  be  required  for  the  training  of  the  14  word  vocabulary  of  digits 
plus  control  words  as  well  as  inputting  the  four  digit  code  number  and  using 
one  or  two  additional  control  words.  A realistic  limit  of  10  seconds  for  in- 
putting the  entire  message  including  any  training  and  verification  was  estab- 
lished for  the  system  by  RADC.  Therefore,  it  became  obvious  early  in  the  VICI 
program  that  the  development  of  prototype  reference  arrays  which  are  repre- 
sentative of  large  groups  of  speakers  would  be  necessary  in  order  to  achieve 
the  required  recognition  accuracy.  A minimal  training  period  of  five  seconds 
allowed  in  the  specification  for  the  advanced  development  model  could  be  used 
for  single  sample  training  of  two  or  perhaps  three  digits  which  were  the  most 
difficult  to  recognize  accurately  with  universal  reference  arrays.  Experiments 
with  the  use  of  single  digit  training  samples  for  certain  digits  will  be  dis- 
cussed later. 

1.  Alternate  Reference  Arrays 

Several  different  approaches  were  explored  in  the  attempt  to  develop 
an  optimum  reference  array  set  for  universal  speaker  use.  The  first  of  these 


approaches  involved  the  use  of  alternate  reference  arrays  for  each  vocabulary 
word  chosen  such  that  each  array  represented  a wide  variety  of  expected  pro- 
nunciations for  each  word.  In  many  of  the  commercial  applications  in  which 
the  VIP- 100  has  been  used,  it  has  been  noted  that  a particular  talker  has  been 
able  to  achieve  highly  accurate  recognition  for  a large  number  of  words,  es- 
pecially digits,  when  using  another  speaker's  stored  reference  arrays.  Often 
when  this  phonomenon  occurs  the  two  speakers  have  been  found  to  have  been 
raised  in  the  same  geographical  area.  Therefore,  their  pronunciation  of  words 
is  similar  and  insofar  as  the  ASR  system  is  concerned,  they  are  essentially 
identical.  It  should  be  possible  by  storing  alternate  sets  of  prototype  ref- 
erance  arrays  for  each  of  the  required  vocabulary  words,  to  accomodate  a large 
group  of  talkers  from  different  geographic  areas  and  to  achieve  good  recogni- 
tion accuracy  without  additional  training  or  adaptation  for  any  individual 
speaker  using  the  system. 

In  order  to  conduct  initial  experiments  with  the  use  of  alternate 
reference  arrays,  the  14  word  VICI  vocabulary  was  recorded  on  audio  tape  by  a 
total  of  20  talkers.  Each  talker  repeated  each  vocabulary  word  ten  times  as 
he  would  in  a normal  VIP-100  training  phase.  A set  of  special  purpose  versions 
of  the  general  VIP- 100  training  and  recognition  computer  programs  were  con- 
structed in  order  to  allow  data  from  the  VIP-100  preprocessor  to  be  inputted 
to  TTI's  real-time  disk  operating  system  (RDOS) . The  use  of  a disk  memory  in 
conjunction  with  a digital  computer  provided  for  the  storage  of  large  amounts 
of  training  data  in  a convenient  form.  This  special  experimental  software 
also  was  designed  to  produce  correlation  score  matrices  for  a variety  of  con- 
ditions of  talkers  and  word  combinations  from  the  data  stored  on  disk.  These 
correlation  scores  were  calculated  in  the  same  manner  as  the  correlation  pro- 
ducts used  to  choose  the  proper  word  in  the  recognition  mode  of  operation  pre- 
viously described.  In  the  recognition  mode,  however,  the  correlation  products 
represented  a comparison  of  reference  word  arrays  stored  by  the  talker  using 
the  system  in  the  training  phase  with  the  word  array  resulting  from  an  unknown 
input  word  spoken  by  the  speaker  who  had  trained  the  system.  For  these  experi- 
ments, these  correlation  products  resulted  from  comparisons  of  the  same  word 
as  spoken  by  various  talkers,  or  different  words  spoken  by  the  same  talker,  or 
different  words  spoken  by  different  talkers.  The  matrices  formed  from  these 
correlation  products  effectively  allowed  comparisons  of  pronunciations  of  var- 
ious words  and  illustrated  the  similarity  and  dissimilarity  of  different  words. 
Initially,  correlation  matrices  for  each  of  the  14  words  in  the  VICI  vocabulary 
were  constructed.  Figure  4 illustrates  an  abbreviated  matrix  for  10  talkers 
for  the  digit  zero.  The  matrix  in  the  figure  shows  the  correlation  scores 
which  were  calculated  when  the  training  data  array  for  each  of  the  20  talkers 
for  the  word  Zero  was  compared  with  each  of  the  other  20  talkers  for  the  digit 
zero.  The  on-diagonal  elements  of  the  matrix  are  equivalent  to  a self  corre- 
lation which  is  simply  two  times  the  total  number  of  points  in  the  32  x 16 
array  generated  by  the  particular  talker  as  a consequence  of  the  algorithm 
which  computes  correlation  products.  These  matrices  were  then  examined  to 
determine  which  talkers  exhibited  the  best  training  data  correlation  with 
other  talkers  for  each  word  and  which  talkers  showed  the  poorest  correlation 
for  that  word.  Training  data  for  the  five  talkers  representing  the  best  and 
the  worst  correlations  were  chosen  for  each  of  the  14  vocabulary  words.  A 
reference  data  set  was  then  established  by  the  use  of  these  choices.  Each 
vocabulary  word  in  the  set  of  14  then  had  five  alternative  reference  samples 
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Figure  4.  Correlation  scores  for  10  talkers  for  digit  "zero 


from  five  different  talkers.  A test  data  base  was  recorded  by  19  talkers  of 
the  20  who  originally  recorded  the  training  data  base.  This  test  data  base, 
recorded  approximately  two  weeks  after  the  training  data  recordings  were  made, 
included  a total  of  20  repetitions  of  each  of  the  14  vocabulary  words  spoken 
in  a random  order.  Table  1 illustrates  the  test  data  set  used  in  this  and 
subsequent  tests.  The  list  which  contains  140  words  was  read  two  times. 

2.  Merging  Reference  Arrays 

Preliminary  tests  involving  the  use  of  five  different  speakers  for 
each  word  disclosed  recognition  problems  with  a few  talkers  on  certain  words. 
Most  of  these  recognition  problems  were  associated  with  talker  and  word  com- 
binations not  represented  in  the  training  set.  Therefore,  merging  of  train- 
ing data  was  tested  next.  Five  training  samples  of  each  word  from  each  talker 
not  previously  included  in  the  training  set  were  merged  with  five  samples  of 
training  data  from  a talker  previously  included  in  the  training  set.  This 
merging  of  training  data  was  accomplished  simply  by  the  use  of  a conventional 
VIP- 100  software  training  routine  in  which  the  computer  was  instructed  to  in- 
put ten  samples  of  each  word  as  training  data  and  form  a reference  matrix  from 
those  trained  ten  samples  as  is  usually  done  in  a commercial  VIP-100.  Recog- 
nition accuracy  using  these  merged  multiple  representations  improved  to  96.9 
percent  correct  recognition  of  280  words  from  Table  I as  spoken  by  each  of  14 
talkers,  all  of  whom  were  also  included  in  the  training  data  set. 

Figure  5 illustrates  such  a reference  array  (for  the  word  "Erase) 
which  has  been  generated  in  this  manner  from  the  two  talkers,  MH  and  EC.  In- 
dividual reference  arrays  for  "Erase"  for  these  two  speakers  are  shown  in  Fig- 
ure 6.  The  points  in  Figure  5 which  are  encircled  were  contributed  by  one  or 
the  other  but  not  both  of  the  talkers.  All  other  points  appeared  in  the  in- 
dividual reference  arrays  for  each  of  the  talkers.  In  a few  cases  (not  shown) 
points  which  appeared  for  one  or  the  other  talker  were  not  included  in  the 
reference  array  because  they  did  not  meet  the  threshold  criteria  established. 
Since  it  is  possible  to  increase  the  number  of  training  samples  required  to 
generate  a reference  matrices  it  should  be  possible  to  merge  a multiplicity 
of  talkers  speaking  the  same  word  in  order  to  obtain  an  overall  average  pro- 
nunciation for  that  word.  Experiments  with  the  merging  of  reference  arrays 
from  a multiplicity  of  talkers  were  conducted  at  a subsequent  time  and  will 
be  discussed  later. 

At  this  point  in  the  program,  detailed  studies  were  made  of  the  ref- 
erence data  arrays  derived  from  the  original  20-speaker  training  data.  Com- 
parisons between  the  various  talkers  revealed  that  certain  features  (especial- 
ly maxima  above  the  second  vowel  formant)  which  heretofor  had  been  included 
among  the  32  recognition  features  varied  significantly  from  speaker  to  speaker 
for  some  words.  Therefore,  correlation  matrices  for  the  20  talkers  speaking 
14  words  were  created  using  a new  feature  set.  This  set  consisted  of  10  max- 
ima instead  of  the  original  17,  plus  6 negative  slope  features  and  16  phoneme 
and  class  features.  These  new  matrices  were  used,  as  before,  for  the  manual 
selection  of  reference  data  from  five  talkers  for  each  word.  Recognition 
tests  were  then  conducted  for  the  same  14  talkers  as  were  represented  in  the 
previous  test.  The  same  test  data  base  was  used.  Overall  accuracy  improved 
from  96.9  percent  to  97.2  percent. 
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The  computer  program  used  for  calculating  and  printing  correlation 
scores  desIrlbeS  above  Sas  expanded  to  allow  calculation  of  correlation  scores 
between  th^t raining  data  for  all  words  of  all  talkers  of  the  set  stored  on  a 
disk  memory.  With  this  program  modification  it  was  possib  e o es  1 
usefulness  of  the  training  data  of  a particular  word  in  the  VICI  vocabulary 
from  i Dart  cular  talker  as  a possible  universal  training  sample,  for  a give 
number  of  talkers.  Figure  7 illustrates  a correlation^atrix  resulting  rom 
this  program  revision.  The  correlation  scores  for  , ,, 

words  00  through  13  compared  with  word  3 (the  digit  three j or  a 

^ In  this  figure.  Columns  in  the  figure  are  words,  rows  are e talkers i.  Note 
tJHe  atively  high  scores  in  the  column  of  word  3 as  compared  with  the  remain- 
dX  of  Se  matrixf  As  would  be  expected,  the  correlation  score  for  the  digit 
"three"  for  speaker  1 is  the  highest  in  the  matrix  because  it  is  th®  s®1 

^htL:°Sea^ 

S3  speaker  5 

™k:ridiristio„abie « 

several  times  in  a particular  matrix,  rendering  that  particular 
unsatisfactory.  This  procedure  does  not  guarantee  that  th  e 
ho  oorrectly  recognized  because  this  matrix  does  not  consider  the  potentia 
SraS  daL  Kr^her  words.  It  does,  however,  provide  an  effective  way  to 
reject  possible  training  data  samples  which  would  obviously  e unsui  a 

Bv  the  use  of  this  "global"  correlation  technique,  all  words  from 
^ on  talkers  were  individually  correlated 

the  original  training  data  set  *0  ^ le  was  then  evaluated  manually 

and* ranked1  fo^suit ability  ^fe^ence  dlta!  Another  program  modification  al- 

"-rd!  sss  ss  r.,L:rdcor^d-t.  ™ 

a new  training  sample.  The  merged  data  could  then  be  put  out  on  paper  tap 
to  be  used  with  the  VICI  system  as  reference  data. 

A 14  word  reference  array  set  was  constructed  by  merging  data  from 
the  five  hi ghest° ranking  speakers  for  each  word  in  the  global  correlation  of 
20  talkers  mentioned  above.  This  ranking  was  based  upon  examination  of  cor 
lat ion  matrices  such  as  that  shown  in  Figure  7 for  all  14  words  of  the  VICI 
vocabulary  from  20  speakers  (280  matrices  total).  The  example  shown  in  g 
7 would  be  placed  in  the  questionable  category.  Any  other  instances  of  a 
5 co re  Zct  is  higher  for  a word  other  than  the  one  for  which  the  ^trix  was 
generated  put  thaf  word  sample  in  the  "bad"  category. 

2 s^e^rrore/^.^LSted^  2 * 

mg  data  were  me  g sample  was  taken  from  each  speaker  with  as 

training  data  set.  At  least  one  i-aivprq  was  con- 
sign as  eight  samples  from  certain  speakers.  A ~st  . *,i 
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testing.  Recognition  accuracy  for  'RR”d"in  Refraining  set  was  97.45 
= ZfZtli  ln  ^ traini"e  “CC 

was  96.6  percent 


erceiu-. 

Next,  a 70-word  reference  array  set  ^construe  ^ word  were  from 

the  same  spe^  test  just  described.  A test  or  test  indicate 

conduct e^with  this  70 

Lree  set  of  talkers  subsequently  «as  talkers  from  the  set  by  the  use  of 

ing  reference  arrays  as  choosing  the  best  1 talkers^  ^ results  of  another 

correlation  matrices.  This  conclus  resulted  from  merging  the  com 

29  talker  test  in  which  the  re *«»“  “^S^.  previously  reported  tests 
piete  20  Wlke^trarning  ^ a merge  of  five  training  sets  is  shown 

getherfith  the  results  of  the  two  new  tests. 


„.  conarate  20  Merged 

(irVReefMewSds)  (7QVReff Words)  (14  Ref.  Words) 


20  talkers  (in 
training  set) 


97.45 


98.47 


98. 35 


Nine  talkers  (not  in 
training  set 


29  Talkers  overall 


96.6 

97.15 


97.65 

98.22 


97.73 

98.16 


Because  i^e  results  achieved  with^the^20^talker^merge^were^the^best, 

all  further  tests  during  the  program 
merge  techniques. 


E.  one  Word  Training  Sample  Experiments 


Une  VVU  lu.  — * 

As  outlined  before,  the  use  of  single*rd  Rufed  fo"r 

1 pte  VICI  vocabulary  is  precluded  _ period  of  five  seconds  has  been  al 

,et  with  a single  training  sample  ea  Several  experiments  were  conducted 

single^ repet it ioR5 1 rain°on  SJ  digits  one.  three  and  nine. 
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The  50  talkers  were  recorded  on  audio  tape  and  the  tests  wore  run  from 
tape,  subsequently.  The  test  procedure  did  not  exactly  duplicate  operational 
conditions  insofar  as  the  use  of  training  digits  was  concerned.  The  tapes 
were  recorded  by  each  speaker  with  50  four-digit  groups  spoken  first,  with 
single  digits  from  zero  through  nine  spoken  as  five  sets  following  the  digit 
groups.  Table  II  is  a list  of  the  digit  groups.  In  the  tests  with  the  three 
training  digits,  for  each  speaker  the  VICI  system  was  first  trained  with  a 
single  sample  for  each  of  three  digits  taken  from  the  sets  of  digits  zero 
through  nine.  The  50  digit  groups  then  followed.  Therefore,  as  far  as  the 
speakers  were  concerned  the  training  digits  did  not  directly  precede  the  code 
digits  as  they  would  in  an  operational  situation.  It  can  be  reasonably  in- 
ferred that  this  test  procedure  would  result  in  accuracy  slightly  inferior  to 
that  realizable  with  a live  input  consisting  of  three  training  digits  directly 
followed  by  a four  digit  code.  Test  results  appear  to  bear  out  such  an  infer- 
ence. Although  in  most  cases  recognition  improved  with  the  use  of  the  three 
training  digits  a few  speakers  suffered  slightly  lower  accuracy  with  the  train- 
ing digits.  Results  of  these  tests  are  shown  in  Table  III.  This  table  shows 
individual  digit  errors  and  corrections.  In  most  cases  there  was  only  one 
digit  error  per  group  of  four  so  that  group  error  totals  are  only  slightly 
lower  than  digit  error  totals.  Any  group  of  four  digits  in  which  any  digit 
was  mispronounced  or  garbled  was  net  counted  at  all,  i.e.,  all  other  digits  in 
that  group  were  ignored.  Of  a possible  total  of  2500  groups  from  50  talkers, 
2458  we  e usable  groups.  Recognition  accuracy  on  a group  basis  went  from  95.3 
percent  without  training  digits  to  97.96  percent  with  three  training  digits. 
Individual  digit  accuracy  of  usable  groups  went  from  98.77  percent  without  to 
99.92  with  training  digits.  Figures  9 and  10  are  error  matrices  for  these  tests . 
Note  that  in  order  to  show  more  clearly  the  error  distribution,  the  correct 
responses  have  been  omitted.  Figure  9 shows  the  error  matrix  for  the  digit 
errors  involved  the  digits  1,  3 and  9.  Therefore,  the  single  repetition  ixi*in 
experiments  with  these  three  digits  as  the  training  digits  could  be  expected 
to  show  significantly  reduced  errors.  The  results  shown  in  Figure  9 prove  out 
this  assumption. 

F.  Recognition  Networks  for  Universal  Speaker  Sets 

The  phoneme- like  feature  recognition  networks  included  in  the  final  VICI 
feature  array  previously  discussed  can  be  described  by  the  use  of  logic  equa- 
tion as  shown  in  Table  IV.  These  logic  equations  can  be  translated  into  equiv- 
alent logic  diagrams.  The  notatxonal  rules  for  these  logic  equations  are  as 
follows : 

1.  An  expression  of  the  form  ( ■£  XQ1  - iY<w  indicates  that  the  excit- 
atory quantity  and  the  inhibitory  (subtractive)  quantity  Q2  are 
integrated  with  time  constants  Tj  and  T2  and  employ  gain  factors  X 
and  Y,  respectively. 

2.  The  analytical  expression  for  the  binary  AND  function  will  be  of 
the  form  C = A B,  where  C represents  the  digital  output  of  the  AND 
gate  for  the  two  i:  puts  A and  B which  can  be  in  analog  or  digital 
form. 

3.  The  expression  for  a logical  OR  function  will  be  the  form  C = A + B. 
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TABLE  II  LIST  OF  50  FOUR  DIGIT  GROUPS  USED 
FOR  TESTS  WITH  TRAINING  DIGITS 
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TABLE  III  RESULTS  OF  SINGLE  DIGIT  TRAIN  EXPERIMENTS 


Net  Digit 
Errors  After 
Train 


Net  Digit 
Errors  Before 
Train 


Errors  Added 
With  Train 


Errors  Deleted 
With  Train 


Number  of 
Speakers 


SPOKEN 


Q3ZIND03ETH 


Error  Matrix  for  50  Speakers  Speaking 
With  Single  Repetition  Training  on  th 


TABLE  IV  PHONEME-LIKE  FEATURE  RECOGNITION  LOGIC  EQUATIONS  (SHEET  1 OF  4] 
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FEATURE  RECOGNITION  LOGIC  EQUATIONS  (SHEET  2 OF  4) 


V PHONEME-LIKE  FEATURE  RECOGNITION  LOGIC  EQUATIONS 


PHONEME-LIKE  FEATURE  RECOGNITION  LOGIC  EQUATIONS 


4.  The  summation  symbol  Z.Q  will  be  used  to  indicate  a plurality  of 
(analog)  input  signals  of  the  same  type  to  an  ATL  element.  In  each 
case  Q represents  the  type  of  input  signal,  m and  n represent  the 
interval  over  which  the  feature  is  summed. 

5.  The  networks  or  portions  of  networks  which  were  constructed  or  modi- 
fied expressly  for  the  VICI  vocabulary  are  underlined  with  broken 
lines . 

An  example  of  the  relationship  between  the  logic  diagram  and  the  logic 
equation  for  a particular  feature  recognition  network  is  shown  in  Fig.  11. 

The  network  shown  in  the  figure  was  designed  to  recognize  /£/  in  a conventional 
VIP-100  preprocessor.  This  phoneme  is  not  currently  included  in  the  VICI  fea- 
ture array  because  it  does  not  appear  in  the  VICI  vocabulary.  The  network  in- 
cludes as  inputs  both  binary  and  analog  representation  of  positive  slopes.  De- 
sign considerations  for  this  network  are  explained  in  the  following  paragraph. 

This  fricative  consonant  is  characterized  in  wide-band  speech  by  broad 
noise- like  frequency  bands  above  lk  Hz  with  a broad  energy  peak  in  the  2-3  kHz 
region.  The  resultant  primary  features  useful  for  the  detection  of  this  char- 
acteristic are  positive  slopes  (PSB)  up  through  channels  10  or  11  of  the  VIP 
preprocessor.  This  phoneme  is  separated  from  the  similar  fricative,  /s/  by 
the  strength  of  positive  slopes  in  channels  8 through  10  as  compared  with  the 
positive  slopes  in  channels  12  through  14.  The  phoneme  /$/  with  a lower  fre- 
quency concentration  of  energy  in  its  spectrum  as  compared  with  /s/  can  be  ex- 
pected to  have  stronger  slopes  in  channels  8 thorugh  10  as  compared  with  chan- 
nels 12  through  14.  This  separation  is  accomplished  by  the  ATL  element  in  the 
network  as  shown  in  the  figure.  The  integration  time  constant  associated  with 
the  inputs  of  the  ATL  element  is  5 milliseconds.  An  input  resistor  value  of 
42. 2K  ohms  is  used  for  each  input  resulting  in  a gain  factor  of  0.8  times  for 
each  input.  The  unity  gain  input  resistance  for  an  ATL  element  is  34K  ohms. 
Lower  values  of  input  resistance  will  therefore  result  in  gains  of  greater  than 
one.  Binary  representations  of  positive  slopes  in  channels  4 through  10  to- 
gether with  the  unvoiced  noise-like  (UVNLC)  feature  typical  of  fricatives  are 
ANDed  together  with  the  output  of  the  ATL  element. 

Networks  for  the  Zero  Crossing  feature  included  in  the  VICI  feature  array 
and  another  special  feature,  labeled  BRST  in  several  equations  are  not  described 
in  Table  IV  because  they  are  not  easily  expressable  in  logic  equations.  The 
BRST  network  is  a specialized  arrangement  of  digital  logic  designed  to  minimize 
the  deleterious  effects  of  a burst  at  the  end  of  the  digit  8 or  the  control 
word  TERMINATE.  These  bursts  are  quite  variable  with  regard  to  spectrum  and 
frequency  of  occurrence  from. talker  to  talker.  Therefore,  a major  effort  was 
successfully  made  to  ignore  the  burst  when  it  occurred. 

G.  VICI  Software 

The  software  package  developed  for  use  with  the  VICI  system  was  based  up- 
on a standard  VIP-100  program  for  recognizing  a vocabulary  of  up  to  76  words 
with  storage  capacity  for  training  data  for  one  speaker  at  a time  in  the  8K  of 
core  memory  supplied  with  the  system.  Included  as  integral  parts  of  this  pro- 
gram were  the  training  algorithm,  recognition  algorithm  and  an  output  routine 
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for  driving  a 16  character  Burroughs  Self-Scan  display  and  an  ASR  33  Teletype. 
This  large  vocabulary  capacity  allowed  the  experiments  with  the  use  of  up  to 
five  alternate  reference  arrays  for  each  of  the  14  words  in  the  VICI  vocabu- 
lary, thus  requiring  in  effect  a 70  word  vocabulary.  Subsequent  to  these  ex- 
periments with  multiple  reference  arrays,  merging  of  reference  arrays  of  large 
number  of  talkers  was  found  to  provide  equal  recognition  accuracy  for  large 
number  of  speakers.  The  use  of  a small  reference  array  set  provides  faster 
inputting  of  data.  The  approximate  recognition  time  of  the  VIP- 100  software 
for  an  input  word  is  100  ms  with  a vocabulary  size  of  14  and  300  ms  with  a 
vocabulary  size  of  70.  This  time  is  processing  time  required  by  the  minicom- 
puter after  the  cessation  of  the  word  and  does  not  include  the  actual  time 
required  to  pronounce  the  word.  The  computer  can  accept  a new  input  word  dur- 
ing the  decision  processing  for  a previous  input.  However,  the  larger  size 
vocabulary  requires  significantly  greater  processing  time  because  of  the  ne- 
cessity in  the  recognition  process  of  correlating  the  feature  array  of  the 
input  word  with  each  reference  array  entry.  Each  correlation  requires  approx- 
imately 4 ms. 

A major  modification  of  the  conventional  VIP-100  recognition  software  was 
made  to  allow  certain  additional  correlations  to  take  place  after  the  initial 
recognition  decision.  These  additional  correlations,  known  as  "second-look", 
involve  only  the  initial  portion  of  the  feature  array  of  an  input  word  and  se- 
lected reference  arrays.  To  facilitate  an  explanation  of  this  special  corre- 
lation, a review  of  the  normal  VIP- 100  correlation  process  is  in  order. 

After  the  input  word  is  time  normalized  into  16  time  slots  the  resultant 
array  which  is  32  features  wide  (composed  of  digital  ones  and  zeros)  is  com- 
pared with  each  stored  reference  array.  Similarities  and  dissimalarities  in 
each  array  are  compared  and  appropriately  weighted  and  the  net  result  provides 
a weighted  correlation  product.  Two  other  correlation  products  are  produced 
for  each  reference  array  after  shifting  the  input  array  1 one  time  slot.  The 
highest  correlation  product  for  each  stored  reference  array  is  then  compared 
with  the  highest  products  for  each  other  word.  The  overall  maximum  product 
decides  which  word  is  recognized. 

The  second- look  correlation  routine,  when  made  operative,  correlates  in 
a similar  manner  the  first  five  time  slots  only  of  the  input  array  and  a se- 
lected reference  array.  The  highest  correlation  product  of  this  special  corre- 
lation is  then  multiplied  by  four  and  is  added  to  the  original  product  for  the 
particular  reference  array  selected  for  che  special  correlation.  This  special 
routine  is  used  for  selected  words  only.  Second-look  takes  place  only  if  the 
initial  correlation  routine  selects  the  input  word  as  zero,  one,  five,  eight, 
or  TERMINATE.  For  each  of  these  words  a different  set  of  reference  arrays  are 
involved.  If  the  initial  correlation  choice  is  the  digit  zero,  the  input  array 
is  recorrelated  against  the  reference  array  for  zero  and  for  two,  because  most 
confusions  involving  the  digit  zero  have  been  with  the  digit  two.  Likewise, 
if  a one  is  recognized  the  second  look  correlation  occurs  with  the  reference 
arrays  for  one,  four  and  five.  If  a five  is  recognized  second- look  occurs  for 
five  and  nine.  If  eight  is  chosen  the  reference  arrays  for  eight,  two,  and 
three  are  recorrelated.  In  a similar  manner,  TERMINATE  initiates  second-look 
for  TERMINATE  and  three.  The  second- look  routine  has  been  found  to  be  quite 
effective  in  increasing  accuracy  for  some  talkers. 
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Figures  12  and  13  illustrate  by  means  of  error  matrices  the  extent  of 
such  improvement.  Figure  12  is  the  error  matrix  of  a test  without  second- 
look  of  34  speakers  reading  the  random  word  list  shown  in  Table  I.  Figure  l 4 
is  a similar  test  of  the  same  speakers  and  words  with  the  second- look  config- 
uraticn  as  shown  above  except  for  the  TERMINATE-three  combination  which  was 
not  added  until  this  test  was  completed.  Also  a recognition  logic  change  was 
made  which  effected  principally  the  control  word  TERMINATE.  This  change  re- 
sulted in  fewer  misrecognitions  of  TERMINATE  in  the  second  test  but  more  in- 
stances of  the  digit  three  being  misrecognized  as  TERMINATE.  The  subsequent 
addition  of  the  TERMINATE-three  combination  to  the  second-look  routine  virtu- 
ally eliminated  this  confusion  as  reference  back  to  Figure  9 will  disclose. 
This  figure  is  the  error  matrix  for  50  speakers  resulting  from  digit  groups. 

The  second- look  routine  was  found  to  be  especially  helpful  in  reducing 
the  number  of  times  the  digit  four  was  misrecognized  as  one  (21  times  without 
second- look,  3 times  with).  The  number  of  3-8  confusion  and  9-5  confusion  was 
also  reduced  as  is  illustrated  by  comparison  of  Figures  12  and  is. 


! Error  Matrix  for  34  Speakers 
Vocabulary  List  Without  Second-Look 
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Section  III 


FINAL  SYSTEM  TESTS 


A.  Background  of  Test  Data 

Final  testing  of  the  VICI  system  to  establish  performance  levels  was  con- 
ducted by  the  use  of  both  tape  recorded  and  live  inputs  from  a total  of  8 
male  talkers  ranging  in  age  from  16  years  to  65  years.  Tape  recordings  were 
made  of  digits  and  control  words  spoken  by  65  talkers  over  a period  of  eig 
months  from  August  1974  to  March  1975.  In  addition,  special  training  data  re- 
cordings  of  20  talkers,  all  TTI  employees,  were  made  in  July  1974.  This  tram 

ing  data  from  which  universal  reference  arrays  «er%deri^do^^S^ds°fa;ers 
repetitions  of  each  VICI  vocabulary  word  as  spoken  by  each  of  the  20  spe  . 
These  same  20  talkers  also  later  recorded  independent  test  data.  No  less  than 
IwoweeS  elapsed  between  the  time  any  test  and  any  training  data  were  recorded 
by  the  same  talker.  All  recordings  were  made  with  Telex  model^OO  noi 
celling  microphones.  Figure  14  is  a frequency  response  plot  of  one  of  the  two 
microphones  used  for  these  recordings.  The  other  microphone  had  a similar  re- 
sponse. 

The  test  data  initially  recorded  were  of  the  list  of  digits  and  control 
words  in  random  order  as  shown  in  Table  I.  Each  speaker  recording  i this  list 
read  it  two  times,  thus  producing  a total  of  280  words.  A total  of  41>taik 
recorded  this  list  including  the  20  who  had  previously  recorded  the  original 
training  data.  Data  recorded  by  four  Air  Force  employees  of  Wnght-Patterson 
Air  Force  Base  in  November  1973  for  another  contract  were  also^ed^  tests 
involving  the  Table  1 list.  Those  four  Air  Force  employees  .poke  the  digits 
and  two  control  words,  ERASE  and  TERMINATE,  10  times  per  word  in  random  order. 
These  latter  four  recordings  were  also  made  with  a Telex  1200  microphone.  The 

list  of  50  four-digit  groups  shown  in  Table  II  was  r®^d®d  of^he^l  who 
eluding  18  of  the  20  who  recorded  original  training  data  and  30  of  the  41  who 
recorded  the  list  of  random  digits  and  control  words  sh own  in  Table  I.  For^ 
the  live  tests,  the  four-digit  group  list  was  expanded  to  75  groups  as  shown 

in  Table  IV.  Two  live  tests  were  conducted,  one  at  TTI  Pr^ ^Hv^test 

of  the  VICI  system  to  RADC  and  one  at  RADC  upon  delivery.  The  first  live  test 
was  a 10  talker  test.  Nine  of  these  10  talkers  were  TTI  employees,  all  of  whom 
had  participated  in  recording  original  training  data,  in  the  recording  of  the 
list  of  digits  and  control  words  in  random  order  and  in  the  ^cording  of  the 
50  four-digit  groups.  The  tenth  participant  in  the  live  test  at  TTI  «as 

an  RADC  representative,  who  supervised  the  test.  The  * el®  of  RADC a^d 

at  RADC  included  21  speakers,  20  of  whom  were  civilian  employees  of  RADC  and 
military  personnel  stationed  at  RADC.  The  twenty-first  speaker  in  the  test 
at  RADCYwas  PS,  the  TTI  project  engineer  for  VICI  who  also  participate 

the  live  test  at  TTI. 


B.  Final  Testing  From  Tape 


The  final  tests  with  tape 
the  list  of  digits  and  control 
The  final  test  with  the  former 
base  during  the  development  of 


recorded  test  data  included  two  groups  of  data, 
words  in  random  order,  and  the  four-digit  groups 
data  culminated  a series  of  tests  with  this  data 
the  system.  The  four-digit  test  conducted  sub- 
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final  live 
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TABLE  V 


tst  OF  75  FOUR-DIGIT  GROUps  *0R 

->IST  UF  /3  ruu‘  _ rrTT  ANn  AT  RADC 


sequently  si.ul.ted  to  an  extent  operational 

u.Wi-ss^s  ;s:  sstuS-^—  -s  -o8».y 

to  verify  and  correct  errors. 

1.  Random  Digit  and  Control  Word  Test  Results 

The  final  test  of  280  words,  digits  and  control  words  in  °^er 

was  conducted  shortly  before  the 

overall  word  accuracy  without  the  word  recognition 

word  accuracy  for  45  speakers  was  98^83  percent. di  recognition  accuracy 

"SS&  ;"o^r  « 98!t  per«„t.P  Figure  15  is  the  error  .atrix  resulting  fro. 
this  test. 

2.  Four-Digit  Group  Tests 

A test  of  50  speakers  each  uttering  50  groups  of  four  digits  each 
was  conducted  to  simulate  operating  conditions^ri^°^P^e^®^e*Vngje  word  train- 
arC  81^.  n ^esrex^riments  were  conduct^  in  conjunction  with  the  final 
StTf^-d^tYrSps  To.recapitu, at.  these 

training  digits  the  single digit  accuracy  -a « ^ Yc„r«™as  95*3  per- 
as  in  the  test  described  above.  The  tour  Uxgit  gro  y 7 error  correc- 

cent.  Because  the  tests  were  conducted  from  tapes,  there  was  no  erro 

tion. 

C.  Final  Testing  with  Live  Inputs 

The  final  test  of  the  VICI  system  with  live  inputs  was  conducted  in  two 

p~ rS-x  iSTLc  r 

conducted  S£S5  ?&£ . 

ad'usted°toa£“S  TX 

incorporated  in  these  tests.  correcJion  for  both  tests,  all  speak- 

cluding  f°“ J1^1  * ® fJny'  input  all  of  the  75  digit  groups  shown  in  Table 

carreer  M 

indicating  to  observers  of  the  test  that  the  speakern^,^  ^ verified 
four-di  git6  group  £S  ^-put^  to  the^BlSS  syste,  ^«ERIFY 

ISrTS  s^sfM'rC-digit  group  -as  entered  correctly,  thus 
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effectively  .eking  each  group  a five  word  entry.  The statistic,  1»  Jable  IV 
involving  the  time  required  for  entering  codes  include  the  entry  ot  the  wor 

lodes  correctly.  The  individual  digit  recognition  accuracies  of  gs. 24  percent 
for  the  test  at  TTI  and  97.75  percent  for  the  test  at  RADC  were  in  goo  g 
ment  with  single  digit  accuracy  from  tape. 
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Section  IV 


CONCLUSIONS  AND  RECOMMENDATIONS 


A.  Conclusions 

The  VICI  system  is  being  developed  as  a front  end  for  the  BISS  automatic 
speaker  verification  system  to  provide  a reliable  fully  automatic  means  for 
entering  speaker  verification  data.  The  VICI  has  demonstrated  high  accuracy 
capability  as  an  isolated-word  recognition  system  for  the  English  digits  plus 
four  control  words.  Recognition  accuracy  for  individual  digits  is  approxi- 
mately 98  percent  without  error  correction  or  any  individual  adaptation  for  a 
population  of  85  male  talkers  ranging  from  16  to  65  years  of  age.  The  design 
requirement  of  98  percent  accuracy  for  the  input  of  four-digit  code  groups 
with  error  correction  by  the  speaker  has  been  exceeded. 

In  two  live  speaker  tests,  everyone  of  a total  of  30  speakers  each  utter- 
ing 75  code  groups  of  four  digits  each  was  able,  with  the  aid  of  error  correc- 
tion, to  input  correctly  all  of  the  code  groups,  for  100  percent  accuracy.  The 
average  time  required  to  speak  and  verify  each  code  group  and  the  word  VERIFY 
was  6.24  seconds  for  21  talkers  in  one  of  these  tests.  Time  was  not  recorded 
for  the  other  test.  This  average  entry  time  included  correction  of  misrecog- 
nized  digits  when  necessary.  Error  correction  was  accomplished  by  allowing 
each  speaker  to  view  on  a display  the  recognition  decision  immediately  after 
it  was  spoken  (within  .1  to  .2  seconds).  A recognition  error  could  then  be 
corrected  by  saying  the  word  ERASE  which  deleted  the  incorrect  digit,  and  say- 
ing the  digit  again.  Occasionally,  speakers  would  pronounce  several  digits  or 
a complete  group  before  realizing  an  initial  error.  In  this  instance,  the  group 
could  be  deleted  by  saying  CANCEL. 

The  performance  levels  achieved  by  the  VICI  system  were  made  possible  by 
a number  of  modifications  to  a basic  VIP-100  speech  recognition  system  manu- 
factured by  Threshold  Technology  Inc.  for  commercial  applications.  The  modi- 
fications, in  both  hardware  and  software,  were  tailored  to  the  limited  vocabu- 
lary set  required.  Extensive  experimentation  was  conducted  to  determine  the 
optimum  configuration  of  the  reference  array  data  to  be  used  in  the  VICI  system 
for  recognition  of  the  digits  and  control  words.  The  merging  of  training  ar- 
rays generated  by  several  speakers  to  form  a master  reference  array  has  been 
found  to  be  quite  useful.  The  final  VICI  reference  array  set  resulted  in  a 
merge  of  training  arrays  from  20  speakers.  This  reference  array  set  was  then 
used  for  all  final  testing. 

B.  Recommendations 

The  VICI  system  as  presently  constituted  operates  with  a headband-mounted 
noise-cancelling  microphone  connected  by  a high  quality  wire  or  radio  link  to 
the  preprocessor.  For  field  use  with  the  BISS  system,  it  is  likely  that  oper- 
ations under  less  than  ideal  conditions  will  be  necessary.  The  operational 
constraints  could  include  a handheld  microphone  or  telephone-handset  microphone 
with  a 300  to  3 kHz  wire  link  connecting  the  microphone  at  a remote  entry  point 
to  the  VICI  system  located  at  a central  location.  An  investigation  into  the 
use  of  a handheld  microphone  transmitting  speech  over  a wire-line  could  result 
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in  modification  of  the  VICI  to  allow  operation  under  such  conditions. 

The  VICI  system  has  been  developed  for  General  American  male  speakers. 
Operational  conditions  will  undoubtedly  include  inputs  by  females  as  well  as 
males.  The  VIP- 100  speaker-dependent  word  recognition  system  upon  which  VICI 
is  based  has  shown  excellent  performance  with  female  as  well  as  male  speakers 
in  numerous  commercial  applications.  Therefore,  modifications  of  the  VICI 
system  to  accept  female  as  well  as  male  speakers  should  not  be  difficult.  A 
possible  approach  to  fully  universal  speaker  operation  would  be  the  use  of  two 
or  more  alternate  reference  array  sets,  at  least  one  for  each  sex.  The  use  of 
alternate  reference  arrays  for  recognition  of  large  numbers  of  male  talkers 
has  been  successfully  tested  and  was  described  in  Section  II  of  this  report. 
These  techniques  should  be  extended  to  allow  male  and  female  talkers  to  use 
the  system. 

The  very  high  accuracy  of  the  VICI  system  was  enhanced  by  the  use  of  man- 
ual error  correction  by  the  speakers  testing  the  system  live.  An  alternative 
approach  to  manual  correction  is  the  use  of  error  correcting  codes  and/or  check 
digits.  A study  of  the  use  of  such  codes  should  be  conducted  in  order  to  mini- 
mize the  amount  of  manual  error  correction  necessary  and  thereby  speed  the  in- 
putting of  digits. 

Certain  digit  confusions  such  as  1-4  and  3-8  were  found  to  occur  in  the 
final  testing  with  enough  frequency  to  cause  a few  speakers  some  annoyance. 
Also,  the  use  of  the  word  "niner"  for  the  digit  nine  which  was  done  routinely 
in  the  testing  has  been  deemed  undesirable  from  an  operational  standpoint  by 
RADC  project  personnel.  Therefore,  these  problem  areas  should  be  given  special 
attention  in  any  program  for  improvement  of  the  VICI  system. 
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Rome  Air  Development  Center 


R. ADC  is  the  principal  AFSC  organization  charged  with 
planning  and  executing  the  USAF  exploratory  and  advanced 
development  programs  for  information  sciences , intelli- 
gence, command,  control  and  communications  technology, 
products  and  services  oriented  to  the  needs  of  the  USAF. 
Primary  RADC  mission  areas  are  communications , electro- 
magnetic guidance  and  control,  surveillance  of  ground 
and  aerospace  objects,  intelligence  data  collection  and 
handling,  information  system  technology , and  electronic 
reliability,  maintainability  and  compatibility . RADC 
has  mission  responsibility  as  assigned  by  AFSC  for  de- 
monstration and  acquisition  of  selected  subsystems  and 
systems  in  the  intelligence,  mapping,  charting,  command, 
control  and  communications  areas. 


